feat(hackathon): add benchmarking strandly script by lizradway · Pull Request #1087 · strands-agents/sdk-typescript

lizradway · 2026-05-19T17:42:53Z

Description

Adds a strandly benchmark command that runs Strands agents against ContextBench — a code investigation benchmark that measures how well an agent finds relevant code for real GitHub issues.

The benchmark:

Loads tasks from ContextBench's gold-annotated dataset (parquet files with file/symbol/span annotations)
Runs a Strands agent (Bedrock Claude by default) with bash tool to investigate the target repo
Evaluates the agent's trajectory against gold annotations via ContextBench's Python evaluation
Reports file/symbol/span coverage and precision metrics
Optionally emits metrics to CloudWatch for trending

Includes 6 built-in configs testing different context management strategies (control, offloader, offloader-aggressive, summarizing, sliding-proactive, offloader-summarizing, these will be updated to the built in context management strategies!!!!), support for custom agent files via --agent-file, configurable model via --model, and a --min-coverage flag for future CI gating.

Usage:

strandly benchmark --suite contextbench
strandly benchmark --suite contextbench --agent-file ./my-agent.ts --cloudwatch

Related Issues

N/A

Documentation PR

N/A — README included at strandly/src/benchmark/README.md

Type of Change

New feature

Testing

How have you tested the change?

Ran end-to-end with Bedrock Claude Sonnet 4 on django__django-15987 task — achieved 100% file coverage
Verified CloudWatch metric emission
Verified --agent-file custom config loading
Verified --min-coverage threshold gating
Verified error handling (missing API keys, invalid model IDs, timeouts)
TypeScript type-checks clean (npx tsc --noEmit --project strandly/tsconfig.json)
I ran npm run check

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2026-05-21T18:02:22Z

Assessment: Comment

This is a well-structured benchmarking tool for the strandly CLI. The code is organized cleanly with good separation of concerns (loader, runner, evaluator, reporter, cloudwatch). The types are well-defined and the README documentation is thorough.

Review Categories

Testing: No unit tests are included for any of the pure functions (trajectory extraction, report generation, metric parsing). Even for hackathon/prototype code, the regex-heavy trajectory logic would benefit from test coverage.
Code Hygiene: Unused zod dependency, unused ROOT constant, and .cache/ directory not gitignored.
Composability: process.exit() calls in library functions make the code harder to test and reuse — throwing errors and handling at the CLI boundary would be more idiomatic.
Robustness: Timer leak in Promise.race timeout pattern, and no validation of --min-coverage input.

Nice addition to the developer tooling — the integration with ContextBench's evaluation framework and the CloudWatch metrics emission for trending are particularly useful.

agent-of-mkmeral · 2026-05-28T20:23:32Z

Really nice addition — clean separation (loader/runner/evaluator/reporter/cloudwatch), thorough README, and integrating ContextBench is exactly the right call. Since it's a draft devtool I skipped style nits and focused on bugs that would make the benchmark numbers wrong, because that's what makes a benchmark misleading rather than just rough. I checked out the branch locally and tested the trajectory parser empirically.

TL;DR: three issues in the measurement layer systematically bias the cross-config comparison the tool exists to make. None block merging as a prototype — but I'd hold off publishing the numbers as a strategy comparison until #1–#3 are addressed.

🔴 Critical (these change the scores)

1. Trajectory is reconstructed from agent.messages after the run — but conversation managers mutate that array in place.
When the run is long enough to trigger context reduction (exactly what these strategies target), the early toolUse blocks are already gone by extraction time, so the agent's early file reads aren't counted. Importantly, ContextOffloader only rewrites toolResult content and leaves toolUse intact, while SlidingWindow/Summarizing splice messages out — so offloader configs keep their full trajectory and the others don't. That's a structural advantage unrelated to investigation quality.

2. The bash path regex misses most common investigation commands.
grep/rg/ls/find/awk contribute zero files, head -n/tail -n lose the file, multi-file cat keeps only the first, and less +100 file.py injects a garbage path. Coverage ends up reflecting the model's bash phrasing more than what it actually found.

3. Symbol / Span / EditLoc metrics are structurally always 0.000 (spans/symbols hardcoded empty in evaluator.ts) but the report table + README present them as real measurements.

Details, evidence & suggested fixes for #1–#3

#1 — post-run trajectory extraction vs. in-place mutation

// runner.ts — runs AFTER agent.invoke() completes
const trajectory = extractTrajectory(agent.messages, repoDir)

SlidingWindowConversationManager → messages.splice(0, trimIndex) (sliding-window-conversation-manager.ts:268) removes oldest messages.
SummarizingConversationManager → messages.splice(0, messagesToSummarizeCount, summaryMessage) (summarizing-conversation-manager.ts:173) replaces oldest ~30% with a summary.
ContextOffloader → only edits toolResult content via AfterToolCallEvent; toolUse blocks survive (plugin.ts).

Net: coverage is under-counted for compressing configs and under-counted more the more aggressively they compress — so the comparison is apples-to-oranges. Suggested fix: capture tool calls live with a BeforeToolCallEvent/AfterToolCallEvent hook into a side list, instead of reconstructing from the post-run message array. This is the single highest-impact change.

#2 — regex parser (tested locally against extractFilePathsFromToolCall)

Command	Extracted	Should be
`head -n 50 forms/models.py`	`[]`	models.py
`tail -n 100 file.py`	`[]`	file.py
`cat a.py b.py c.py`	`[a.py]`	all three
`grep -rn 'def clean' forms/`	`[]`	(search hits)
`rg 'class ModelForm'`	`[]`	(search hits)
`less +100 file.py`	`[+100]` ⚠️	file.py
`cat ./forms/models.py`	`./forms/...`	forms/...

The -n flag is captured as the path then dropped by the startsWith('-') guard, so the file is silently lost. Suggested fix: strip flags (-n, -50, +100) before treating tokens as paths, handle multiple files per command, and capture grep/rg/find targets — or better, have the agent read through a structured tool you can parse reliably rather than reverse-engineering bash.

#3 — always-zero metrics

runner.ts calls evaluate(task, fileList) with no spans, and evaluator.ts hardcodes spans: spans ?? {}, symbols: {}, so symbol/span/editloc are always 0. The reporter.ts table and README still list them as real metrics. Suggested fix: either populate spans/symbols (the TrajectoryEntry already has startLine/endLine fields) or drop those rows from the report + README until they're wired up.

🟠 Worth addressing for reproducibility

#4 nondeterminism · #5 path normalization

4. BedrockModel({ modelId, stream: false }) doesn't set temperature, and each config runs once on a single task. Run-to-run sampling variance will look like config differences (especially in CloudWatch trends). Consider temperature: 0, and ideally multiple seeds/tasks with reported variance — otherwise a 5–10% swing is within noise.

5. toRelativePath doesn't strip ./ or resolve relative paths, so ./foo.py or a bare path after cd won't match gold foo.py — counts as both a miss and a false positive. Normalize to repo-relative POSIX paths before comparing.

🟡 Minor (fine for a devtool)

#6 timer leak · #7 token definition · #8 double checkout

6. runner.ts clears timeoutId on the success path but not in the catch, leaving the 10-min timer pending on errors (also flagged by the bot).
7. CloudWatch TokenUsage = input+output, but the README/reporter frame "Input Tokens" as the cost proxy — pick one definition so trend lines stay consistent.
8. git checkout A || git fetch && git checkout A parses as (checkout || fetch) && checkout, so the happy path checks out twice. Harmless, just wasteful.

Great foundation overall — fixing #1 (event-hook trajectory capture) and #2 (flag/multi-file/grep handling) would make the cross-config comparison trustworthy, and #3 is a quick "implement or hide" decision. Happy to help if useful!

Partial fix to strands-agents#1069 - previously the agent would prematurely exit if the agent generated a tool with an invalid name; this avoids that by ensuring the agent loop continues with zero tool-uses. --------- Co-authored-by: Mackenzie Zastrow <zastrowm@users.noreply.github.com>

lizradway had a problem deploying to manual-approval May 19, 2026 17:43 — with GitHub Actions Error

feat(hackathon): add benchmarking strandley script

0eeb389

lizradway force-pushed the benchmark branch from 11debfc to 0eeb389 Compare May 19, 2026 17:48

lizradway temporarily deployed to manual-approval May 19, 2026 17:48 — with GitHub Actions Inactive

lizradway had a problem deploying to manual-approval May 19, 2026 17:49 — with GitHub Actions Error

github-actions Bot added the strands-running <strands-managed> Whether or not an agent is currently running label May 21, 2026